World Bank Educational Data Analysis for academy EdTech Platform¶
This notebook analyzes World Bank EdStats data to identify international expansion opportunities for academy's EdTech platform. We'll explore educational indicators across countries to determine the most promising markets for expansion.
1. Loading and Initial Exploration¶
import os
import pandas as pd
import glob
# Get the current working directory (notebook's directory)
notebook_directory = os.getcwd()
# Relative path to the dataset directory
dataset_directory = 'Dataset'
dfs = {}
csv_files = glob.glob(os.path.join(notebook_directory, dataset_directory, "*.csv"))
for file_path in csv_files:
# Extract just the filename without extension
file_name = os.path.splitext(os.path.basename(file_path))[0]
df = pd.read_csv(file_path)
# Drop columns that are entirely empty
df = df.dropna(axis=1, how='all')
dfs[file_name] = df
print(f"Loaded: {file_name} ({df.shape[0]} rows, {df.shape[1]} columns)")
Loaded: EdStatsCountry-Series (613 rows, 3 columns) Loaded: EdStatsCountry (241 rows, 31 columns)
Loaded: EdStatsData (886930 rows, 69 columns)
Loaded: EdStatsFootNote (643638 rows, 4 columns) Loaded: EdStatsSeries (3665 rows, 15 columns)
2. Data Validation and Quality Assessment¶
Let's examine the data for quality issues, including missing values, duplicates, and other anomalies that might affect our analysis.
import pandas as pd
import matplotlib.pyplot as plt
import missingno as msno
def check_missing_values(dfs):
"""
Analyzes and visualizes missing values in each DataFrame in the provided dictionary.
Parameters:
- dfs (dict): Dictionary of DataFrames where keys are DataFrame names
"""
for df_name, df in dfs.items():
# Calculate missing values and percentages
missing = df.isnull().sum()
missing_percent = (missing / len(df)) * 100
missing_data = pd.DataFrame({
'Missing Values': missing,
'Percentage': missing_percent
}).sort_values('Percentage', ascending=False)
# Filter to only show columns with missing values
missing_data = missing_data[missing_data['Missing Values'] > 0]
# Print results
print(f"\n===== Missing Values in {df_name} =====")
if missing_data.empty:
print("No missing values found.")
else:
print(missing_data)
# Create visualization using missingno
if len(df) > 0: # Only create visualization if DataFrame is not empty
plt.figure(figsize=(12, 8))
msno.matrix(df.sample(min(1000, len(df))), figsize=(12, 8))
plt.title(f'Missing Values in {df_name}')
plt.show()
# Check for missing values
check_missing_values(dfs)
===== Missing Values in EdStatsCountry-Series ===== No missing values found.
<Figure size 1200x800 with 0 Axes>
===== Missing Values in EdStatsCountry =====
Missing Values Percentage
National accounts reference year 209 86.721992
Alternative conversion factor 194 80.497925
Other groups 183 75.933610
Latest industrial data 134 55.601660
Vital registration complete 130 53.941909
External debt Reporting status 117 48.547718
Latest household survey 100 41.493776
Latest agricultural census 99 41.078838
Lending category 97 40.248963
Special Notes 96 39.834025
PPP survey year 96 39.834025
Source of most recent Income and expenditure data 81 33.609959
Government Accounting concept 80 33.195021
Latest water withdrawal data 62 25.726141
Balance of Payments Manual in use 60 24.896266
IMF data dissemination standard 60 24.896266
Latest trade data 56 23.236515
SNA price valuation 44 18.257261
System of trade 41 17.012448
National accounts base year 36 14.937759
Latest population census 28 11.618257
Region 27 11.203320
Income Group 27 11.203320
Currency Unit 26 10.788382
System of National Accounts 26 10.788382
2-alpha code 3 1.244813
WB-2 code 1 0.414938
<Figure size 1200x800 with 0 Axes>
===== Missing Values in EdStatsData =====
Missing Values Percentage
2017 886787 99.983877
2016 870470 98.144160
1971 851393 95.993258
1973 851385 95.992356
1972 851311 95.984012
... ... ...
2011 740918 83.537370
2012 739666 83.396209
2000 710254 80.080051
2005 702822 79.242105
2010 644488 72.665036
[65 rows x 2 columns]
<Figure size 1200x800 with 0 Axes>
===== Missing Values in EdStatsFootNote ===== No missing values found.
<Figure size 1200x800 with 0 Axes>
===== Missing Values in EdStatsSeries =====
Missing Values Percentage
Development relevance 3662 99.918145
General comments 3651 99.618008
Limitations and exceptions 3651 99.618008
Statistical concept and methodology 3642 99.372442
Aggregation method 3618 98.717599
Periodicity 3566 97.298772
Related source links 3450 94.133697
Base Period 3351 91.432469
Other notes 3113 84.938608
Short definition 1509 41.173261
<Figure size 1200x800 with 0 Axes>
from src.scripts.data_validator import DataValidator
# Create an instance
validator = DataValidator()
# Generate metadata for all dataframes
metadata_dfs = validator.create_metadata_dfs(dfs)
# Display the metadata
validator.display_metadata_dfs(metadata_dfs)
# Check for duplicates
ignore_fields = {
'EdStatsCountry-Series': ['DESCRIPTION'],
'EdStatsCountry': ['Special Notes'],
'EdStatsFootNote': ['DESCRIPTION'],
'EdStatsSeries': ['Short definition']
}
mandatory_fields = {
'EdStatsCountry-Series': ['CountryCode','SeriesCode'],
'EdStatsCountry': ['Country Code','Table Name'],
'EdStatsData': ['Country Code','Indicator Code'],
'EdStatsFootNote': ['CountryCode','SeriesCode','Year'],
'EdStatsSeries': ['Series Code','Indicator Name']
}
duplicate_results = validator.check_duplicates(dfs, ignore_fields, mandatory_fields)
Metadata for metadata_EdStatsCountry-Series (613, 3):
Column Name Dtype Type Fill Percentage NaN Percentage \
0 CountryCode object str(3) 100.0 0.0
1 SeriesCode object str(17) 100.0 0.0
2 DESCRIPTION object str(278) 100.0 0.0
Bad Null Percentage
0 0.0
1 0.0
2 0.0
Metadata for metadata_EdStatsCountry (241, 31):
Column Name Dtype Type \
0 Country Code object str(3)
1 Short Name object str(46)
2 Table Name object str(46)
3 Long Name object str(73)
4 2-alpha code object str(2)
5 Currency Unit object str(42)
6 Special Notes object str(1008)
7 Region object str(26)
8 Income Group object str(20)
9 WB-2 code object str(2)
10 National accounts base year object str(119)
11 National accounts reference year float64 float(6)
12 SNA price valuation object str(36)
13 Lending category object str(5)
14 Other groups object str(9)
15 System of National Accounts object str(62)
16 Alternative conversion factor object str(18)
17 PPP survey year object str(7)
18 Balance of Payments Manual in use object str(44)
19 External debt Reporting status object str(11)
20 System of trade object str(20)
21 Government Accounting concept object str(31)
22 IMF data dissemination standard object str(42)
23 Latest population census object str(174)
24 Latest household survey object str(141)
25 Source of most recent Income and expenditure data object str(90)
26 Vital registration complete object str(48)
27 Latest agricultural census object str(36)
28 Latest industrial data float64 float(6)
29 Latest trade data float64 float(6)
30 Latest water withdrawal data object str(27)
Fill Percentage NaN Percentage Bad Null Percentage
0 100.000000 0.000000 0.0
1 100.000000 0.000000 0.0
2 100.000000 0.000000 0.0
3 100.000000 0.000000 0.0
4 98.755187 1.244813 0.0
5 89.211618 10.788382 0.0
6 60.165975 39.834025 0.0
7 88.796680 11.203320 0.0
8 88.796680 11.203320 0.0
9 99.585062 0.414938 0.0
10 85.062241 14.937759 0.0
11 13.278008 86.721992 0.0
12 81.742739 18.257261 0.0
13 59.751037 40.248963 0.0
14 24.066390 75.933610 0.0
15 89.211618 10.788382 0.0
16 19.502075 80.497925 0.0
17 60.165975 39.834025 0.0
18 75.103734 24.896266 0.0
19 51.452282 48.547718 0.0
20 82.987552 17.012448 0.0
21 66.804979 33.195021 0.0
22 75.103734 24.896266 0.0
23 88.381743 11.618257 0.0
24 58.506224 41.493776 0.0
25 66.390041 33.609959 0.0
26 46.058091 53.941909 0.0
27 58.921162 41.078838 0.0
28 44.398340 55.601660 0.0
29 76.763485 23.236515 0.0
30 74.273859 25.726141 0.0
Metadata for metadata_EdStatsData (886930, 69):
Column Name Dtype Type Fill Percentage NaN Percentage \
0 Country Name object str(50) 100.000000 0.000000
1 Country Code object str(3) 100.000000 0.000000
2 Indicator Name object str(164) 100.000000 0.000000
3 Indicator Code object str(30) 100.000000 0.000000
4 1970 float64 float(18) 8.150361 91.849639
.. ... ... ... ... ...
64 2080 float64 float(10) 5.799330 94.200670
65 2085 float64 float(10) 5.799330 94.200670
66 2090 float64 float(10) 5.799330 94.200670
67 2095 float64 float(10) 5.799330 94.200670
68 2100 float64 float(10) 5.799330 94.200670
Bad Null Percentage
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
.. ...
64 0.0
65 0.0
66 0.0
67 0.0
68 0.0
[69 rows x 6 columns]
Metadata for metadata_EdStatsFootNote (643638, 4):
Column Name Dtype Type Fill Percentage NaN Percentage \
0 CountryCode object str(3) 100.0 0.0
1 SeriesCode object str(30) 100.0 0.0
2 Year object str(6) 100.0 0.0
3 DESCRIPTION object str(1132) 100.0 0.0
Bad Null Percentage
0 0.0
1 0.0
2 0.0
3 0.0
Metadata for metadata_EdStatsSeries (3665, 15):
Column Name Dtype Type Fill Percentage \
0 Series Code object str(30) 100.000000
1 Topic object str(92) 100.000000
2 Indicator Name object str(162) 100.000000
3 Short definition object str(2192) 58.826739
4 Long definition object str(2192) 100.000000
5 Periodicity object str(6) 2.701228
6 Base Period object str(26) 8.567531
7 Other notes object str(39) 15.061392
8 Aggregation method object str(16) 1.282401
9 Limitations and exceptions object str(3275) 0.381992
10 General comments object str(699) 0.381992
11 Source object str(472) 100.000000
12 Statistical concept and methodology object str(896) 0.627558
13 Development relevance object str(2555) 0.081855
14 Related source links object str(36) 5.866303
NaN Percentage Bad Null Percentage
0 0.000000 0.0
1 0.000000 0.0
2 0.000000 0.0
3 41.173261 0.0
4 0.000000 0.0
5 97.298772 0.0
6 91.432469 0.0
7 84.938608 0.0
8 98.717599 0.0
9 99.618008 0.0
10 99.618008 0.0
11 0.000000 0.0
12 99.372442 0.0
13 99.918145 0.0
14 94.133697 0.0
DataFrame 'EdStatsCountry-Series': 0 raw duplicate rows found, 0 composite key duplicate rows found
DataFrame 'EdStatsCountry': 0 raw duplicate rows found, 0 composite key duplicate rows found
DataFrame 'EdStatsData': 0 raw duplicate rows found, 0 composite key duplicate rows found
DataFrame 'EdStatsFootNote': 0 raw duplicate rows found, 0 composite key duplicate rows found DataFrame 'EdStatsSeries': 0 raw duplicate rows found, 0 composite key duplicate rows found
3. Data Frame Simplification Based on Interesting KPIs¶
Now that we have a global overview of the structure and quality of the datasets, we will create a subset of data by focusing on the initial business need.
- "We are an online education startup that wants to expand internationally, targeting French/English-speaking audiences aged 15 to 24 for high school to university level training."
3.1 Filtering by Excluding Non-Relevant KPIs¶
We will code the filtering script to exclude non-relevant indicators based on our target audience and specific needs.
Below is a script that filters the educational data to focus only on the most relevant indicators for our target demographic (French/English-speaking audiences aged 15-24 in high school to university settings):
import re
def filter_kpis(EdStatsSeries):
# List of terms to ignore in the 'Topic'
ignore_list = ['Economic Policy', 'Education Equality', 'SABER', 'Early Childhood Education', 'Health', 'Social Protection', 'Population', 'Teachers', 'Pre-Primary', 'Primary', 'Learning Outcomes', 'Literacy']
ignore_pattern = '|'.join(ignore_list).replace('%', '.*')
# List of terms to ignore in the 'Indicator Name' for lower school levels
school_level_ignore_list = ['1st', '2nd', '3rd', '4th', '5th', '6th', 'primary', 'No Education', 'Grade']
school_level_ignore_pattern = '|'.join(school_level_ignore_list)
# List of terms to ignore for gender-related entries
gender_ignore_list = ['Female', 'Male', 'gender parity index', 'male (%)', 'toilets', 'male (number)', 'male (years)']
gender_ignore_pattern = '|'.join(re.escape(term) for term in gender_ignore_list)
# Define the pattern to exclude entries related to ages over 24 or under 15
age_pattern = r'age\s*(?:[0-9]|1[0-4]|2[5-9]|[3-9]\d|\d{3,})|\bage\s*(?:[0-9]\+|1[0-4]\+|2[5-9]\+|[3-9]\d\+|\d{3,}\+)|\bage\s*(?:[0-9]-\d{2,}|1[0-4]-\d{2,}|2[5-9]-\d{2,}|[3-9]\d-\d{2,}|\d{3,}-\d{2,})'
# Define the pattern to keep rows with 'Learning Outcomes' if 'Indicator Name' contains 'French' or 'English'
language_pattern = r'French|English'
# Filter rows based on 'Topic' and 'Indicator Name', while keeping 'Learning Outcomes' with 'French' or 'English'
filtered_df = EdStatsSeries[
(~EdStatsSeries['Topic'].str.contains(ignore_pattern, regex=True) |
((EdStatsSeries['Topic'] == 'Learning Outcomes') & EdStatsSeries['Indicator Name'].str.contains(language_pattern, regex=True, case=False))) &
~EdStatsSeries['Indicator Name'].str.contains(age_pattern, regex=True, case=False) &
~EdStatsSeries['Indicator Name'].str.contains(school_level_ignore_pattern, regex=True, case=False) &
~EdStatsSeries['Indicator Name'].str.contains(gender_ignore_pattern, regex=True, case=False)
][['Topic', 'Series Code', 'Indicator Name', 'Short definition']].drop_duplicates()
return filtered_df
# call the function
df_filtered_kpis = filter_kpis(dfs["EdStatsSeries"])
# Display the shape of the filtered DataFrame
print(f"Shape of the filtered DataFrame: {df_filtered_kpis.shape}")
Shape of the filtered DataFrame: (190, 4)
Fantastic! We've gone from a list of 3665 indicators down to just 190. This will make it much easier to make a selection.
3.2 Manual Filtering Based on a Scenario¶
For online training, learners need:¶
- Good quality internet connection
- A device to access it (PC, smartphone, tablet, etc.) => Search for one or more indicators on these infrastructure subjects
For easy international expansion, a common language between learners and the training center is needed¶
=> Search for an indicator on language proficiency in French (or English)
We need a socio-economic context and the ability to project with:¶
- Information on demographics and their evolution over at least 5-10 years in the target population segment
- Information on the evolution of education continuation rates for high school/university levels
- Information on the evolution of the education budget in the targeted segment
- Information on the evolution of the country's GDP => Search for a set of indicators to have a socio-economic vision and understand market opportunities generically
And the cherry on top:¶
- Any indicator that helps better understand the needs and desires of the future generation to ensure that the company's values align with the future market => Maximum 2 indicators to tell a story
Categories and Top KPIs¶
| Category | Indicator | Code |
|---|---|---|
| 1. Internet Connection & Device Access | Internet users (per 100 people) | IT.NET.USER.P2 |
| Personal computers (per 100 people) | IT.CMP.PCMP.P2 | |
| 2. Language Proficiency | Total inbound internationally mobile students | UIS.MS.56.T |
| Net flow ratio of internationally mobile students (%) | UIS.MENFR.56 | |
| Inbound mobility rate (%) | UIS.MSEP.56 | |
| 3. Socio-economic Context | ||
| 3.1 Demographics | Projection: Population with Lower Secondary education (thousands) | PRJ.POP.ALL.2.MF |
| Projection: Population with Upper Secondary education (thousands) | PRJ.POP.ALL.3.MF | |
| Projection: Population with Post Secondary education (thousands) | PRJ.POP.ALL.4.MF | |
| 3.2 Education Continuation | Gross enrollment ratio, secondary (%) | SE.SEC.ENRR |
| Gross enrollment ratio, tertiary (%) | SE.TER.ENRR | |
| Graduates from tertiary education (number) | SE.TER.GRAD | |
| 3.3 Education Budget | Expenditure on secondary as % of government education expenditure | SE.XPD.SECO.ZS |
| Expenditure on tertiary as % of government education expenditure | SE.XPD.TERT.ZS | |
| Government expenditure on education as % of GDP | SE.XPD.TOTL.GD.ZS | |
| 3.4 GDP | (To be sourced from external data) | - |
| 4. Future Generation Needs | Share of youth not in education, employment or training (%) | SL.UEM.NEET.ZS |
| Percentage of students in secondary vocational programs (%) | SE.SEC.ENRL.VO.ZS |
# List of Series Codes based on selected indicators
codeserielist = [
"IT.NET.USER.P2", # Internet users (per 100 people)
"IT.CMP.PCMP.P2", # Personal computers (per 100 people)
"UIS.MS.56.T", # Total inbound internationally mobile students, both sexes
"UIS.MENFR.56", # Net flow ratio of internationally mobile students (inbound - outbound), both sexes (%)
"UIS.MSEP.56", # Inbound mobility rate, both sexes (%)
"PRJ.POP.ALL.2.MF", # Projection: Population in thousands by highest level of educational attainment. Lower Secondary. Total
"PRJ.POP.ALL.3.MF", # Projection: Population in thousands by highest level of educational attainment. Upper Secondary. Total
"PRJ.POP.ALL.4.MF", # Projection: Population in thousands by highest level of educational attainment. Post Secondary. Total
"SE.SEC.ENRR", # Gross enrolment ratio, secondary, both sexes (%)
"SE.TER.ENRR", # Gross enrolment ratio, tertiary, both sexes (%)
"SE.TER.GRAD", # Graduates from tertiary education, both sexes (number)
"SE.XPD.SECO.ZS", # Expenditure on secondary as % of government expenditure on education (%)
"SE.XPD.TERT.ZS", # Expenditure on tertiary as % of government expenditure on education (%)
"SE.XPD.TOTL.GD.ZS", # Government expenditure on education as % of GDP (%)
"SL.UEM.NEET.ZS", # Share of youth not in education, employment or training, total (% of youth population)
"SE.SEC.ENRL.VO.ZS" # Percentage of students in secondary education enrolled in vocational programmes, both sexes (%)
]
# Filter the data based on the provided list of Series Codes
df_kpis = df_filtered_kpis[df_filtered_kpis["Series Code"].isin(codeserielist)]
df_kpis
| Topic | Series Code | Indicator Name | Short definition | |
|---|---|---|---|---|
| 610 | Infrastructure: Communications | IT.CMP.PCMP.P2 | Personal computers (per 100 people) | NaN |
| 611 | Infrastructure: Communications | IT.NET.USER.P2 | Internet users (per 100 people) | NaN |
| 1981 | Attainment | PRJ.POP.ALL.2.MF | Projection: Population in thousands by highest... | NaN |
| 1984 | Attainment | PRJ.POP.ALL.3.MF | Projection: Population in thousands by highest... | NaN |
| 1987 | Attainment | PRJ.POP.ALL.4.MF | Projection: Population in thousands by highest... | NaN |
| 2306 | Secondary | SE.SEC.ENRL.VO.ZS | Percentage of students in secondary education ... | NaN |
| 2307 | Secondary | SE.SEC.ENRR | Gross enrolment ratio, secondary, both sexes (%) | NaN |
| 2335 | Tertiary | SE.TER.ENRR | Gross enrolment ratio, tertiary, both sexes (%) | NaN |
| 2338 | Tertiary | SE.TER.GRAD | Graduates from tertiary education, both sexes ... | NaN |
| 2377 | Expenditures | SE.XPD.SECO.ZS | Expenditure on secondary as % of government ex... | Expenditure on education by level of education... |
| 2379 | Expenditures | SE.XPD.TERT.ZS | Expenditure on tertiary as % of government exp... | Expenditure on education by level of education... |
| 2381 | Expenditures | SE.XPD.TOTL.GD.ZS | Government expenditure on education as % of GD... | Total general (local, regional and central) go... |
| 2397 | Laber | SL.UEM.NEET.ZS | Share of youth not in education, employment or... | Share of youth not in education, employment or... |
| 3074 | Tertiary | UIS.MENFR.56 | Net flow ratio of internationally mobile stude... | NaN |
| 3076 | Tertiary | UIS.MS.56.T | Total inbound internationally mobile students,... | NaN |
| 3077 | Tertiary | UIS.MSEP.56 | Inbound mobility rate, both sexes (%) | NaN |
3.3 Applying Filtering to our DataFrames¶
Once the relevant KPIs are identified, we will filter our DataFrames to keep only the columns of interest.
Reduction of EdStatsSeries¶
reduced_datasets = {}
# 1. Reduce EdStatsSeries
if "EdStatsSeries" in dfs:
reduced_datasets["EdStatsSeries_reduced"] = dfs["EdStatsSeries"][
dfs["EdStatsSeries"]["Series Code"].isin(df_kpis["Series Code"])
]
print(f"EdStatsSeries_reduced: {reduced_datasets['EdStatsSeries_reduced'].shape}")
EdStatsSeries_reduced: (16, 15)
Reduction of EdStatsCountry¶
EdStatsCountry_reduced improves visibility by focusing on the necessary and sufficient columns for our study. Additionally, we need to add a tag to distinguish between groups and countries.
# 2. Reduce EdStatsCountry
if "EdStatsCountry" in dfs:
important_columns = [
'Country Code', 'Short Name', 'Table Name', 'Long Name',
'Region', 'Income Group'
]
available_columns = [col for col in important_columns if col in dfs["EdStatsCountry"].columns]
reduced_datasets["EdStatsCountry_reduced"] = dfs["EdStatsCountry"][available_columns].copy()
# Add a 'Category' column
if 'Income Group' in reduced_datasets["EdStatsCountry_reduced"].columns:
reduced_datasets["EdStatsCountry_reduced"]['Category'] = reduced_datasets["EdStatsCountry_reduced"]['Income Group'].apply(
lambda x: 'Country' if pd.notna(x) and 'income' in str(x).lower() else 'Unknown'
)
print(f"EdStatsCountry_reduced: {reduced_datasets['EdStatsCountry_reduced'].shape}")
EdStatsCountry_reduced: (241, 7)
Reduction of EdStatsData and EdStatsFootNote¶
EdStatsData_reduced: Contains only theIndicator Namepresent inEdStatsSeries_reduced.EdStatsFootNote_reduced: Contains only theSeries Codepresent inEdStatsSeries_reduced.
# 3. Reduce EdStatsData
if "EdStatsData" in dfs:
# First filter by indicator codes
data_reduced = dfs["EdStatsData"][dfs["EdStatsData"]["Indicator Code"].isin(df_kpis["Series Code"])]
# Then filter by countries if available
if "EdStatsCountry_reduced" in reduced_datasets and "Country Code" in data_reduced.columns:
country_codes = reduced_datasets["EdStatsCountry_reduced"][
reduced_datasets["EdStatsCountry_reduced"]['Category'] == 'Country'
]['Country Code'] if 'Category' in reduced_datasets["EdStatsCountry_reduced"].columns else None
if country_codes is not None:
data_reduced = data_reduced[data_reduced['Country Code'].isin(country_codes)]
# Keep only necessary columns
necessary_columns = ['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code']
year_columns = [str(year) for year in range(2000, 2036) if str(year) in data_reduced.columns]
available_columns = [col for col in necessary_columns + year_columns if col in data_reduced.columns]
reduced_datasets["EdStatsData_reduced"] = data_reduced[available_columns]
print(f"EdStatsData_reduced: {reduced_datasets['EdStatsData_reduced'].shape}")
# 4. Reduce EdStatsFootNote if available
if "EdStatsFootNote" in dfs:
if "SeriesCode" in dfs["EdStatsFootNote"].columns:
reduced_datasets["EdStatsFootNote_reduced"] = dfs["EdStatsFootNote"][
dfs["EdStatsFootNote"]["SeriesCode"].isin(df_kpis["Series Code"])
]
print(f"EdStatsFootNote_reduced: {reduced_datasets['EdStatsFootNote_reduced'].shape}")
EdStatsData_reduced: (3424, 26)
EdStatsFootNote_reduced: (21323, 4)
5. Graphical Analysis¶
5.1 Analysis Objective¶
The objective of this graphical analysis is to provide a visual overview of key performance indicators (KPIs) by country. This approach allows for quickly identifying trends and anomalies across different countries and indicators.
5.2 Methodology¶
- Data Aggregation: For each country and each KPI, we use the latest available value to ensure relevant and up-to-date analysis.
- Heatmaps: Heatmaps are used to visualize data intuitively and identify strengths and weaknesses by country and indicator.
# Import necessary modules
from src.classes.analyze_edstats import EdStatsVisualizer
# Initialize the visualizer with your reduced datasets
visualizer = EdStatsVisualizer(
data_reduced=reduced_datasets["EdStatsData_reduced"],
series_reduced=reduced_datasets["EdStatsSeries_reduced"],
country_reduced=reduced_datasets["EdStatsCountry_reduced"]
)
# Extract latest values
year_columns = [str(year) for year in range(2000, 2036) if str(year) in reduced_datasets["EdStatsData_reduced"].columns]
latest_values = visualizer.extract_latest_values(year_columns)
# Calculate statistics
stats = visualizer.calculate_statistics()
print("Statistics summary:")
stats
Statistics summary:
| Indicator Name | Indicator Code | Mean | Median | Std Dev | Count | |
|---|---|---|---|---|---|---|
| 0 | Personal computers (per 100 people) | IT.CMP.PCMP.P2 | 17.019329 | 7.829152 | 2.218622e+01 | 190 |
| 1 | Internet users (per 100 people) | IT.NET.USER.P2 | 50.933562 | 53.226475 | 2.851759e+01 | 204 |
| 2 | Projection: Population in thousands by highest... | PRJ.POP.ALL.2.MF | 8367.802651 | 1220.155000 | 4.350438e+04 | 166 |
| 3 | Projection: Population in thousands by highest... | PRJ.POP.ALL.3.MF | 12622.268373 | 2362.300000 | 4.186799e+04 | 166 |
| 4 | Projection: Population in thousands by highest... | PRJ.POP.ALL.4.MF | 7295.463675 | 1232.295000 | 2.237595e+04 | 166 |
| 5 | Percentage of students in secondary education ... | SE.SEC.ENRL.VO.ZS | 13.422130 | 8.329990 | 1.370300e+01 | 181 |
| 6 | Gross enrolment ratio, secondary, both sexes (%) | SE.SEC.ENRR | 81.990520 | 88.367233 | 2.908839e+01 | 190 |
| 7 | Gross enrolment ratio, tertiary, both sexes (%) | SE.TER.ENRR | 37.330985 | 29.536000 | 2.804407e+01 | 184 |
| 8 | Graduates from tertiary education, both sexes ... | SE.TER.GRAD | 265176.900506 | 33843.500000 | 1.163433e+06 | 170 |
| 9 | Expenditure on secondary as % of government ex... | SE.XPD.SECO.ZS | 33.392157 | 33.680889 | 1.032088e+01 | 165 |
| 10 | Expenditure on tertiary as % of government exp... | SE.XPD.TERT.ZS | 20.544842 | 20.762760 | 9.307898e+00 | 171 |
| 11 | Government expenditure on education as % of GD... | SE.XPD.TOTL.GD.ZS | 4.732653 | 4.541055 | 2.128752e+00 | 182 |
| 12 | Share of youth not in education, employment or... | SL.UEM.NEET.ZS | 18.744118 | 17.750000 | 8.288301e+00 | 68 |
| 13 | Net flow ratio of internationally mobile stude... | UIS.MENFR.56 | -101.470850 | -2.605710 | 1.022365e+03 | 153 |
| 14 | Total inbound internationally mobile students,... | UIS.MS.56.T | 27740.149046 | 3398.000000 | 8.521866e+04 | 154 |
| 15 | Inbound mobility rate, both sexes (%) | UIS.MSEP.56 | 7.769275 | 2.845035 | 1.484738e+01 | 152 |
# Create the interactive plot with dropdown selector
distribution_plot_with_selector = visualizer.create_distribution_plots_with_selector()
distribution_plot_with_selector.show()
# Import the new classes
from src.classes.visualization import HeatmapVisualizer, ComparisonVisualizer
# Create and display a heatmap with income group dropdown
heatmap_fig = HeatmapVisualizer.create_heatmap(
latest_values=visualizer.latest_values,
series_data=reduced_datasets["EdStatsSeries_reduced"],
country_data=reduced_datasets["EdStatsCountry_reduced"]
)
heatmap_fig.show()
# Create and display income group comparison with indicator dropdown
top_countries, enriched_stats = visualizer.select_top_countries(top_n=5)
comparison_fig = ComparisonVisualizer.create_income_group_comparison(
enriched_stats=enriched_stats,
series_data=reduced_datasets["EdStatsSeries_reduced"],
country_data=reduced_datasets["EdStatsCountry_reduced"]
)
comparison_fig.show()
# Select top countries by income group
top_countries, enriched_stats = visualizer.select_top_countries(top_n=20)
print("Top countries by indicator and income group:")
enriched_stats
Top countries by indicator and income group:
| Indicator Name | Indicator Code | Mean | Median | Std Dev | Count | Income Group | Country Code | Country Rank | Value | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Personal computers (per 100 people) | IT.CMP.PCMP.P2 | 17.019329 | 7.829152 | 22.186224 | 190.0 | NaN | NaN | NaN | NaN |
| 1 | Internet users (per 100 people) | IT.NET.USER.P2 | 50.933562 | 53.226475 | 28.517594 | 204.0 | NaN | NaN | NaN | NaN |
| 2 | Projection: Population in thousands by highest... | PRJ.POP.ALL.2.MF | 8367.802651 | 1220.155000 | 43504.380262 | 166.0 | NaN | NaN | NaN | NaN |
| 3 | Projection: Population in thousands by highest... | PRJ.POP.ALL.3.MF | 12622.268373 | 2362.300000 | 41867.993997 | 166.0 | NaN | NaN | NaN | NaN |
| 4 | Projection: Population in thousands by highest... | PRJ.POP.ALL.4.MF | 7295.463675 | 1232.295000 | 22375.954613 | 166.0 | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1611 | Inbound mobility rate, both sexes (%) | UIS.MSEP.56 | NaN | NaN | NaN | NaN | High income: OECD | SWE | 16.0 | 5.82612 |
| 1612 | Inbound mobility rate, both sexes (%) | UIS.MSEP.56 | NaN | NaN | NaN | NaN | High income: OECD | SVK | 17.0 | 4.85962 |
| 1613 | Inbound mobility rate, both sexes (%) | UIS.MSEP.56 | NaN | NaN | NaN | NaN | High income: OECD | ITA | 18.0 | 4.40275 |
| 1614 | Inbound mobility rate, both sexes (%) | UIS.MSEP.56 | NaN | NaN | NaN | NaN | High income: OECD | GRC | 19.0 | 4.18636 |
| 1615 | Inbound mobility rate, both sexes (%) | UIS.MSEP.56 | NaN | NaN | NaN | NaN | High income: OECD | USA | 20.0 | 3.92751 |
1616 rows × 10 columns
6. Scoring System Development¶
KPI Selection and Weighting¶
| Category | KPI | Weight |
|---|---|---|
| Internet Connection and Device Access (30%) | Internet users (per 100 people) | 15% |
| Personal computers (per 100 people) | 15% | |
| Language Proficiency (20%) | Total number of inbound internationally mobile students | 10% |
| Net flow ratio of internationally mobile students | 5% | |
| Inbound mobility rate | 5% | |
| Socio-economic Context (40%) | Population projections (secondary, higher, post-secondary) | 10% |
| Gross enrollment ratio (secondary and tertiary) | 10% | |
| Tertiary education graduates | 5% | |
| Expenditure on secondary and tertiary education (% of GDP) | 10% | |
| Total expenditure on education (% of GDP) | 5% | |
| Future Generation Needs (10%) | Share of youth not in employment, education, or training | 5% (lower is better) |
| Percentage of secondary students in vocational programs | 5% |
Implementation Plan¶
- Data Preparation: Normalize indicators to a comparable scale (0-10)
- Score Calculation: Apply weights to normalized values and sum for total score
- Visualization: Create dashboard showing:
- Country rankings by income group
- Radar charts comparing KPI performance
- Highlight top opportunities in each region
This scoring system will help identify the most promising countries for Academy's online education expansion, balancing technological readiness, educational demand, and economic factors.
from src.classes.scoring import EdStatsScorer
# Initialize the scorer with country data
scorer = EdStatsScorer(country_data=reduced_datasets["EdStatsCountry_reduced"])
# Create and display the KPI weight visualization
weight_viz = scorer.create_kpi_weight_visualization()
weight_viz.show()
# Calculate scores using the enriched statistics dataframe
scores_df = scorer.calculate_scores(enriched_stats)
# Create and display the income group dashboard
dashboard_fig = scorer.create_income_group_dashboard(top_n=5)
dashboard_fig.show()
# Create the interactive radar chart with top 10 countries
radar_chart = scorer.create_interactive_radar_chart(top_n=20)
# Display the chart
radar_chart.show()
# Initialize the scorer with country data AND visualizer
scorer = EdStatsScorer(
country_data=reduced_datasets["EdStatsCountry_reduced"],
visualizer=visualizer # Pass the visualizer reference here
)
# Calculate scores using the enriched statistics dataframe
scores_df = scorer.calculate_scores(enriched_stats)
# Now create the trend charts
trend_chart = scorer.create_kpi_trend_charts(top_n=5)
trend_chart.show()